Today I Learned

5 posts by jasonschweier

Diff CSVs With Ease

We're all familiar with diff tools like RubyMine, vimdiff, or the good old diff command. These work well for source code but are not optimal for CSVs files, which may have many columns and rows.

I found myself wanting to write CSV-specific diff tool, but I fought that Not Invented Here impulse and Googled first. I found a good Python library called csvdiff.

Imagine we have the following CSV file:

id,name
1,Alan Turing
2,John McCarthy
3,Edger Djikstra

We want to compare it to another file:

id,name
1,Alan Turing
3,Edsger Dijkstra
4,John Von Neumann

We've removed, edited, and added a row. The summary option confirms our changes:

$ csvdiff --style=summary id one.csv two.csv
1 rows removed (33.3%)
1 rows added (33.3%)
1 rows changed (33.3%)

The detailed view gives us all the gory details:

$ csvdiff --style=pretty id one.csv two.csv
{
  "_index": [
    "id"
  ],
  "added": [
    {
      "id": "4",
      "name": "Jon Von Neumann"
    }
  ],
  "changed": [
    {
      "fields": {
        "name": {
          "from": "Edger Djikstra",
          "to": "Edsger Dijkstra"
        }
      },
      "key": [
        "3"
      ]
    }
  ],
  "removed": [
    {
      "id": "2",
      "name": "John McCarthy"
    }
  ]
}

Squish Those Strings

While reading some Rails code, I came across a deprecation warning:

ActiveSupport::Deprecation.warn(<<-MESSAGE.squish)
  `redirect_to :back` is deprecated and will be removed from Rails 5.1.
  Please use `redirect_back(fallback_location: fallback_location)` where
  `fallback_location` represents the location to use if the request has
  no HTTP referer information.
MESSAGE

What caught my eye was the squish method on the heredoc. It's common to see methods after heredocs to clean up formatting, but squish is a great method name.

squish removes all leading and trailing whitespace, then replaces all consecutive whitespace with a single space. One application has been cleaning up long string messages that use the line continuation operator. The community style guide says to only use line continuations for concatenating strings, but I think squish is cleaner:

long_string = "a long string " \
              "spanning " \
              "three lines"
# => "a long string spanning three lines"

better_long_string = "a long string
                      squished to a single line
                      without extra spaces or backslashes".squish
# => "a long string squished to a single line without extra spaces or backslashes"

An all-too-common caveat: squish is an extenstion method from active_support.

Using Trigram Indexes to Speed Up LIKE Queries

Queries containing ILIKE '%anything%' result in table scans since they have a leading wildcard. Typical btree indexes can not help improve performance in this case.

Trigrams are 3-character slices of words:

SELECT show_trgm('Ruby');
-- {"  r"," ru","by ",rub,uby}

They are useful as indexes since they can speed up LIKE, ILIKE, ~, and ~* clauses.

Let's query a table of 14K email addresses:

CREATE TABLE emails (email TEXT);

\copy emails FROM 'emails.csv' DELIMITER ',' CSV;

EXPLAIN ANALYZE SELECT * FROM emails WHERE email ILIKE '%ion%';

The query plan shows a table scan:

Seq Scan on emails
  Filter: (email ~~* '%ion%'::text)
  Rows Removed by Filter: 14040

with an average execution time of 15ms. Let's create a trigram index and try again:

-- postgres 9.2+
CREATE EXTENSION IF NOT EXISTS pg_trgm;

CREATE INDEX emails_search_email_idx ON emails USING GIN (email gin_trgm_ops);

EXPLAIN ANALYZE SELECT * FROM emails WHERE email ILIKE '%ion%';

The improved query plan is:

Bitmap Heap Scan on emails
  Recheck Cond: (email ~~* '%ion%'::text)
  Heap Blocks: exact=98
    -> Bitmap Index Scan on emails_search_email_idx
       Index Cond: (email ~~* '%ion%'::text)

and averages 0.8ms. A 19X improvement without updating any SQL queries! YMMV

SQL's WITH RECURSIVE Query

While optimizing calls to a recursive table, we found a neat SQL solution. It uses a common table expression as a working table to query against iteratively.

Here's an example of using WITH RECURSIVE with a modified nested set example of clothing categories that find all paths through the categories:

CREATE TEMPORARY TABLE categories (id INT, name text, parent_category_id INT);

INSERT INTO categories VALUES
  (1, 'Clothing', null),
  (2, 'Mens''s', 1),
  (3, 'Women''s', 1),
  (4, 'Suits', 2),
  (5, 'Dresses', 3),
  (6, 'Skirts', 3),
  (7, 'Jackets', 4),
  (8, 'Evening Gowns', 5);

WITH RECURSIVE category_hierarchies AS
(SELECT id, parent_category_id, name AS full_path
 FROM categories
 WHERE parent_category_id is NULL

 UNION ALL

 SELECT child_categories.id,
        child_categories.parent_category_id,
        parent_categories.full_path || ' -> ' || child_categories.name as full_path
 FROM categories AS child_categories
 INNER JOIN category_hierarchies AS parent_categories
   ON child_categories.parent_category_id = parent_categories.id
)
SELECT full_path FROM category_hierarchies ORDER BY full_path;

Produces paths through all categories:

  • Clothing
  • Clothing -> Mens's
  • Clothing -> Mens's -> Suits
  • Clothing -> Mens's -> Suits -> Jackets
  • Clothing -> Women's
  • Clothing -> Women's -> Dresses
  • Clothing -> Women's -> Dresses -> Evening Gowns
  • Clothing -> Women's -> Skirts

Read more about WITH RECURSIVE queries

Comparing Version Strings in Ruby

While writing a Ruby script, I needed to check the the version of a binary dependancy. The --version switch gets me the data, but how to compare to the required version?

The binary follows semver, so a quick and dirty attempt might be:

"1.4.2".gsub(".", "") >= "1.3.1".gsub(".", "")
# => true

Unfortunately, this is misleading: we are lexicographically comparing the strings and these strings happen to have the same length. Thus, "142" comes after "131".

Testing that version "1.200.0" is newer than "1.9.0" will fail as "120" comes before "190".

It would be straight-forward to write a small class to parse the string and compare the major, minor, and patch values. But, Ruby has a quick solution provided by RubyGems. Since Ruby 1.9, RubyGems has been included in Ruby's standard library:

Gem::Version.new("1.200.1") >= Gem::Version.new("1.3.1")
# => true

Gem also provides a way handle pessimistic constraints:

dependency = Gem::Dependency.new("", "~> 1.3.1")
dependency.match?("", "1.3.9")
# => true
dependency.match?("", "1.4.1")
# => false