Today I Learned

14 posts by jasonschweier

Dig-ging into Data

Ruby 2.3 added a couple of dig methods to aid in accessing nested data.

Here's some Hash and Array examples:

hash = {
  animals: {
    furry: ["dog", "cat"],
    spikey: ["porcupine", "echidna"]
  }
}

hash.dig(:animals, :furry)
# => ["dog", "cat"]

# safely returns `nil` for missing keys
hash.dig(:plants, :yellow)
# => nil

array = [1, [[2, 3], 4, 5], 6]

array.dig(1, 0, 1)
# => 3

# dig on arrays take indexes, and repeatly call dig on the result
# thus, you have to be careful if the return type is not an array!
array.dig(2)
# => 6

array.dig(2, 3)
# => TypeError: Fixnum does not have #dig method

# and duck typing means they can be combined!
hash.dig(:animals, :spikey, 1)
# => "echidna"

Finding Where Rake Tasks Are Defined in Rails

Rake has an option to print where a task was loaded from.

$ rake -W gettext:find
rake gettext:find   $HOME/.rvm/gems/ruby-2.3.3/gems/gettext_i18n_rails-1.8.0/lib/gettext_i18n_rails/tasks.rb:64:in `block in <top (required)>'

We can also get this data with code. This allows us to explore the tasks (e.g. filtering). In a Rails console:

# load up all the tasks Rails knows about
> Rails.application.load_tasks
...

# if you know the task you are looking for:
> Rake::Task["gettext:find"].actions.map(&:source_location)
 => [["$HOME/.rvm/gems/ruby-2.3.3/gems/gettext_i18n_rails-1.8.0/lib/gettext_i18n_rails/tasks.rb", 64]]

 # if you want to explore known tasks
 > gettext_tasks = Rake.application.tasks.select { |e| e.name.start_with? "gettext:" }
 => [<Rake::Task gettext:add_language => [environment]>, <Rake::Task gettext:base_without_table_attributes => [environment]>, <Rake::Task gettext:find => [setup]>, <Rake::Task gettext:pack => [setup]>, <Rake::Task gettext:setup => [environment]>, <Rake::Task gettext:store_model_attributes => [environment]>]
 >
 > pp Hash[gettext_tasks.map { |t| [t.name, t.locations] }]
 {"gettext:add_language"=>
  ["$HOME/.rvm/gems/ruby-2.3.3/gems/gettext_i18n_rails-1.8.0/lib/gettext_i18n_rails/tasks.rb:99:in `block in <top (required)>'"],
 "gettext:base_without_table_attributes"=>
  ["$HOME/src/packmanager/master/lib/tasks/gettext.rake:27:in `block in <top (required)>'"],
 "gettext:find"=>
  ["$HOME/.rvm/gems/ruby-2.3.3/gems/gettext_i18n_rails-1.8.0/lib/gettext_i18n_rails/tasks.rb:64:in `block in <top (required)>'"],
 "gettext:pack"=>
  ["$HOME/.rvm/gems/ruby-2.3.3/gems/gettext_i18n_rails-1.8.0/lib/gettext_i18n_rails/tasks.rb:59:in `block in <top (required)>'"],
 "gettext:setup"=>
  ["$HOME/.rvm/gems/ruby-2.3.3/gems/gettext_i18n_rails-1.8.0/lib/gettext_i18n_rails/tasks.rb:43:in `block in <top (required)>'"],
 "gettext:store_model_attributes"=>
  ["$HOME/.rvm/gems/ruby-2.3.3/gems/gettext_i18n_rails-1.8.0/lib/gettext_i18n_rails/tasks.rb:82:in `block in <top (required)>'"]}

ORDER BY NULL Sorting Options

The ORDER BY clause has options for placement of NULL columns. From the Postgres docs, the ORDER BY grammar is:

[ ORDER BY expression [ ASC | DESC | USING operator ] [ NULLS { FIRST | LAST } ] [, ...] ]

It shows we can put the NULL columns first or last.

Example

Imagine we have a datetime range, implemented as (start_at, end_at) tuples.

CREATE TABLE dates (id INT, start_at TIMESTAMP, end_at TIMESTAMP);

INSERT INTO dates VALUES 
(1, TIMESTAMP '2018-01-01 9:45', TIMESTAMP '2018-01-01 10:00'),
(2, TIMESTAMP '2018-01-01 10:00', null),
(3, TIMESTAMP '2018-01-01 10:15', null);

We want to order by end_at, then start_at. However, we want the NULL end_at tuples to appear before any with a value.

By default, NULL values will appear after non-NULL. With the NULLS FIRST option, we can change that:

SELECT * FROM dates ORDER BY end_at NULLS FIRST, start_at, id;

╒══════╤═════════════════════╤═════════════════════╕
│ id   │ start_at            │ end_at              │
╞══════╪═════════════════════╪═════════════════════╡
│ 2    │ 2018-01-01 10:00:00 │ <null>              │
├──────┼─────────────────────┼─────────────────────┤
│ 3    │ 2018-01-01 10:15:00 │ <null>              │
├──────┼─────────────────────┼─────────────────────┤
│ 1    │ 2018-01-01 09:45:00 │ 2018-01-01 10:00:00 │
╘══════╧═════════════════════╧═════════════════════╛

You can also specify NULLS when creating indexes too.

Safe Navigation Operator

If you have worked in a Rails project, you have probably came across the try and try! methods. try saves us from checking conditions before drilling-down through a message chain. try! is more restrictive; it raises when the receiver does not respond to the method:

require "active_support"
require "active_support/core_ext/object/try"

User = Struct.new(:name)

users = [User.new("Jason")]

users.first.try(:name)
# "Jason"

users.first.try(:unknown_method)
# nil

users.first.try!(:unknown_method)
# NoMethodError: undefined method `unknown_method' for #<struct User name="Jason">

Note these methods are provided by ActiveSupport. As of Ruby 2.3, this behaviour is now available in the language, called the safe navigation operator &.:

users = [User.new(name: "Jason")]
users.first&.name
# "Jason"

users = []
users.first&.name
# nil

Gotchas

When an object does not respond to a method, &. behaves like try!; it raises an error. However, nil seems to break this pattern:

# behaviour like try!
user&.unknown_method
# NoMethodError: undefined method `unknown_method' for #<struct User name="Jason">

# given that
nil.nil?
# true

# this is confusing
nil&.nil?
# nil

The last example has a simple explanation: &. checks the receiver. Since the receiver is nil, &. immediately returns nil. It does not matter that nil responds to the message.

Block Kwargs

Ruby 2.0 introduced keyword arguments. Ruby 2.1 further added required keyword arguments. It is common to see methods using kwargs:

def some_method(keyword_arg: "Hello", required_arg:)
  puts "#{keyword_arg} #{required_arg}"
end

some_method(required_arg: "world")
# Hello world

However, these changes also apply to block arguments:

define_method :prints_block_arguments do |default_arg: "hello", required_arg:, **others|
  puts "default_arg: #{default_arg}"
  puts "required_arg: #{required_arg}"

  others.each_pair do |key, value|
    puts "other arg: #{key} => #{value}"
  end
end

prints_block_arguments(required_arg: "world")
# default_arg: hello
# required_arg: world

prints_block_arguments(default_arg: "ciao", required_arg: "mondo", something: "else")
# default_arg: ciao
# required_arg: mondo
# other arg: something => else

prints_block_agruments()
# ArgumentError: missing keyword: required_arg

Git Shortlog

There are many ways to customize your git log output. However, the structure still the same: commit by commit.

A new variation I recently discovered was git shortlog. It displays commits by author, rather than most recently committed order. It is intended to help producing release notes.

This lets you find out who your largest contributors are, especially with the --summary option.

Try this:

$ git shortlog --summary | sort -r

Searching Through Gems

Sometimes you need to peak at the source of a class or method of a gem from your Gemfile. Since gems are managed by bundler, it has a command-line option that can help.

bundle show --paths will list the paths of all the gems from your Gemfile. You can grep through these directories to search.

I'm currently using ripgrep for searching at the command line, so can pass the paths there.

Here’s it all together in a simple Bash function:

function bundle-search() {
    rg $1 `bundle show --paths`
}

Enumerable Predicate Methods Have a New Argument

Enumerable#grep accepts an argument that returns elements that are true using case equality (===). These elements can then be passed to its block for further proccessing:

fruits = %w(apples orange grapes)
fruits.grep(/s$/) # => ["apples", "grapes"]
fruits.grep(/s$/) { |e| e.start_with?("a") } # => [true, false]

In Ruby 2.5, this parameter argument was added to the Enumerable predicate methods all?, none?, one?, and any?.

fruits = %w(apples orange grapes)
fruit.any?(/s$/) # => true

# === works with more than Regexes!
[1, 3.14, 2ri].all?(Numeric) # => true
[0, 1, 2, 3].all?(1..10) # => false
[0, 5].one?(1..3) # => false
[0, 5].none?(1..3) # => true

More Expressive Regular Expressions with Ruby

I came across some parsing code that contained this regular expression:

# the first capture is the left quote
# the second capture is the string content
# the third capture is the right quote
METHOD_CAPTURE = /Module\.method\(\s*(['"])(.+?)(['"])\s*\)/

This regexp is matching the string contents of a particular method call. It captures three pieces of information, mentioned in the comments.

This code is not self-documenting. We have three comments explaining the captures, and a regexp we must parse in wetware.

Furthermore, numerical indexes are difficult to track and modify in complicated regexps. PCRE supports named captures; denoted in Ruby by (?<name>pattern).

To make the comments superfluous, let's refactor using the following:

  1. Use named capture groups
  2. Use interpolation to give descriptive names to common components
  3. Don't capture needlessly (in our case, the captured quote characters are not used)
  4. Prefer [] over escaping (personal preference)
quote_character = %q|['"]|
optional_whitespace = "\s*"

METHOD_CAPTURE = /Module[.]method[(]#{optional_whitespace}#{quote_character}(?<string_contents>.+?)#{quote_character}#{optional_whitespace}[)]/

# #match returns an object or nil
matches = 'Module.method("some string")'.match(METHOD_CAPTURE)

# if an object, access captures by name:
matches[:string_contents] # "some string"

You can tailor how many named expressions/variables to use based on the complexity of the regexp.

Diff CSVs With Ease

We're all familiar with diff tools like RubyMine, vimdiff, or the good old diff command. These work well for source code but are not optimal for CSVs files, which may have many columns and rows.

I found myself wanting to write CSV-specific diff tool, but I fought that Not Invented Here impulse and Googled first. I found a good Python library called csvdiff.

Imagine we have the following CSV file:

id,name
1,Alan Turing
2,John McCarthy
3,Edger Djikstra

We want to compare it to another file:

id,name
1,Alan Turing
3,Edsger Dijkstra
4,John Von Neumann

We've removed, edited, and added a row. The summary option confirms our changes:

$ csvdiff --style=summary id one.csv two.csv
1 rows removed (33.3%)
1 rows added (33.3%)
1 rows changed (33.3%)

The detailed view gives us all the gory details:

$ csvdiff --style=pretty id one.csv two.csv
{
  "_index": [
    "id"
  ],
  "added": [
    {
      "id": "4",
      "name": "Jon Von Neumann"
    }
  ],
  "changed": [
    {
      "fields": {
        "name": {
          "from": "Edger Djikstra",
          "to": "Edsger Dijkstra"
        }
      },
      "key": [
        "3"
      ]
    }
  ],
  "removed": [
    {
      "id": "2",
      "name": "John McCarthy"
    }
  ]
}

Squish Those Strings

While reading some Rails code, I came across a deprecation warning:

ActiveSupport::Deprecation.warn(<<-MESSAGE.squish)
  `redirect_to :back` is deprecated and will be removed from Rails 5.1.
  Please use `redirect_back(fallback_location: fallback_location)` where
  `fallback_location` represents the location to use if the request has
  no HTTP referer information.
MESSAGE

What caught my eye was the squish method on the heredoc. It's common to see methods after heredocs to clean up formatting, but squish is a great method name.

squish removes all leading and trailing whitespace, then replaces all consecutive whitespace with a single space. One application has been cleaning up long string messages that use the line continuation operator. The community style guide says to only use line continuations for concatenating strings, but I think squish is cleaner:

long_string = "a long string " \
              "spanning " \
              "three lines"
# => "a long string spanning three lines"

better_long_string = "a long string
                      squished to a single line
                      without extra spaces or backslashes".squish
# => "a long string squished to a single line without extra spaces or backslashes"

An all-too-common caveat: squish is an extenstion method from active_support.

Using Trigram Indexes to Speed Up LIKE Queries

Queries containing ILIKE '%anything%' result in table scans since they have a leading wildcard. Typical btree indexes can not help improve performance in this case.

Trigrams are 3-character slices of words:

SELECT show_trgm('Ruby');
-- {"  r"," ru","by ",rub,uby}

They are useful as indexes since they can speed up LIKE, ILIKE, ~, and ~* clauses.

Let's query a table of 14K email addresses:

CREATE TABLE emails (email TEXT);

\copy emails FROM 'emails.csv' DELIMITER ',' CSV;

EXPLAIN ANALYZE SELECT * FROM emails WHERE email ILIKE '%ion%';

The query plan shows a table scan:

Seq Scan on emails
  Filter: (email ~~* '%ion%'::text)
  Rows Removed by Filter: 14040

with an average execution time of 15ms. Let's create a trigram index and try again:

-- postgres 9.2+
CREATE EXTENSION IF NOT EXISTS pg_trgm;

CREATE INDEX emails_search_email_idx ON emails USING GIN (email gin_trgm_ops);

EXPLAIN ANALYZE SELECT * FROM emails WHERE email ILIKE '%ion%';

The improved query plan is:

Bitmap Heap Scan on emails
  Recheck Cond: (email ~~* '%ion%'::text)
  Heap Blocks: exact=98
    -> Bitmap Index Scan on emails_search_email_idx
       Index Cond: (email ~~* '%ion%'::text)

and averages 0.8ms. A 19X improvement without updating any SQL queries! YMMV

SQL's WITH RECURSIVE Query

While optimizing calls to a recursive table, we found a neat SQL solution. It uses a common table expression as a working table to query against iteratively.

Here's an example of using WITH RECURSIVE with a modified nested set example of clothing categories that find all paths through the categories:

CREATE TEMPORARY TABLE categories (id INT, name text, parent_category_id INT);

INSERT INTO categories VALUES
  (1, 'Clothing', null),
  (2, 'Mens''s', 1),
  (3, 'Women''s', 1),
  (4, 'Suits', 2),
  (5, 'Dresses', 3),
  (6, 'Skirts', 3),
  (7, 'Jackets', 4),
  (8, 'Evening Gowns', 5);

WITH RECURSIVE category_hierarchies AS
(SELECT id, parent_category_id, name AS full_path
 FROM categories
 WHERE parent_category_id is NULL

 UNION ALL

 SELECT child_categories.id,
        child_categories.parent_category_id,
        parent_categories.full_path || ' -> ' || child_categories.name as full_path
 FROM categories AS child_categories
 INNER JOIN category_hierarchies AS parent_categories
   ON child_categories.parent_category_id = parent_categories.id
)
SELECT full_path FROM category_hierarchies ORDER BY full_path;

Produces paths through all categories:

  • Clothing
  • Clothing -> Mens's
  • Clothing -> Mens's -> Suits
  • Clothing -> Mens's -> Suits -> Jackets
  • Clothing -> Women's
  • Clothing -> Women's -> Dresses
  • Clothing -> Women's -> Dresses -> Evening Gowns
  • Clothing -> Women's -> Skirts

Read more about WITH RECURSIVE queries

Comparing Version Strings in Ruby

While writing a Ruby script, I needed to check the the version of a binary dependancy. The --version switch gets me the data, but how to compare to the required version?

The binary follows semver, so a quick and dirty attempt might be:

"1.4.2".gsub(".", "") >= "1.3.1".gsub(".", "")
# => true

Unfortunately, this is misleading: we are lexicographically comparing the strings and these strings happen to have the same length. Thus, "142" comes after "131".

Testing that version "1.200.0" is newer than "1.9.0" will fail as "120" comes before "190".

It would be straight-forward to write a small class to parse the string and compare the major, minor, and patch values. But, Ruby has a quick solution provided by RubyGems. Since Ruby 1.9, RubyGems has been included in Ruby's standard library:

Gem::Version.new("1.200.1") >= Gem::Version.new("1.3.1")
# => true

Gem also provides a way handle pessimistic constraints:

dependency = Gem::Dependency.new("", "~> 1.3.1")
dependency.match?("", "1.3.9")
# => true
dependency.match?("", "1.4.1")
# => false