更新日: 2014年5月16日
Elasticsearch

ElasticsearchのインストールとCSVからのデータ挿入

全文検索システムを Rails アプリケーションで利用するために、Elasticsearch を試してみます。Elasticsearch は Solr 同様、Apache Luceneベースですので、動作には Java 環境が必要です。今回は、Elasticsearch の入門編ということで、インストールとデータをCSVからインポートさせるところまで行います。

Elasticsearch.org Open Source Distributed Real Time Search & Analytics | Elasticsearch

【お知らせ】英単語を画像イメージで楽に暗記できる辞書サイトを作りました。英語学習中の方は、ぜひご利用ください！

画像付き英語辞書 Imagict | 英単語をイメージで暗記
【開発記録】
英単語を画像イメージで暗記できる英語辞書サービスを作って公開しました

Elasticsearch については日本語の情報ページとしては、以下の記事が導入に大変参考になりました。

Elasticsearchチュートリアル – 不可視点
 実践！Elasticsearch – Wantedly Engineer Blog
Kuromojiで日本語全文検索 – AWSで始めるElasticSearch(1) ｜ Developers.IO

ありがとうございました。

Elasticsearch とプラグインのインストール、動作確認

まずは、Elasticsearch 本体をローカルの Mac 開発環境にインストール。homebrew で一発で入りました。


$ brew install elasticsearch
==> Downloading https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.1.1.tar.gz

$ brew install elasticsearch

==> Downloading https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.1.1.tar.gz

続いて便利そうなプラグインをいくつか入れてみます。

Plugins

/usr/local/bin/plugin コマンドを使用してプラグインをインストールします。私は /usr/local/bin に PATH を通しているので以下 plugin コマンドで進めます。

管理・モニター用のプラグインである marvel をインストール。

Marvel Documentation


$ plugin -install elasticsearch/marvel/latest

$ plugin -install elasticsearch/marvel/latest

クエリの管理とデバッグに便利そうな Inquisitor Plugin を入れる。

https://github.com/polyfractal/elasticsearch-inquisitor


$ plugin -install polyfractal/elasticsearch-inquisitor

$ plugin -install polyfractal/elasticsearch-inquisitor

web のフロントエンドである Elasticsearch Head Plugin を入れる。

https://github.com/mobz/elasticsearch-head


$ sudo plugin -install mobz/elasticsearch-head

$ sudo plugin -install mobz/elasticsearch-head

日本語の解析に必須である Kuromoji プラグインを入れる。

https://github.com/elasticsearch/elasticsearch-analysis-kuromoji


$ plugin -install elasticsearch/elasticsearch-analysis-kuromoji/2.0.0

$ plugin -install elasticsearch/elasticsearch-analysis-kuromoji/2.0.0

あとで CSV データのインポートを試してみるので、CSV River Plugin を入れる。

https://github.com/AgileWorksOrg/elasticsearch-river-csv


$ bin/plugin -install river-csv -url https://github.com/AgileWorksOrg/elasticsearch-river-csv/releases/download/2.0.1/elasticsearch-river-csv-2.0.1.zip

$ bin/plugin -install river-csv -url https://github.com/AgileWorksOrg/elasticsearch-river-csv/releases/download/2.0.1/elasticsearch-river-csv-2.0.1.zip

とりあえずプラグインはこのくらいで。

ここまで終わったら elasticsearch を起動。


$ elasticsearch

$ elasticsearch

http://127.0.0.1:9200/ にアクセスして “status”: 200 が返っていればOK。また、head プラグインをインストールした場合、http://127.0.0.1:9200/_plugin/head/ にアクセスして head のインターフェースが表示されることが確認できるはずです。

Elasticsearch に入れるデータをダウンロード

Elasticsearchチュートリアル – 不可視点のページにならい、Livedoorグルメのデータを利用してみることにします。以下の作業は ~/work ディレクトリで行いました。


$ cd ~/work
$ wget https://github.com/livedoor/datasets/raw/master/ldgourmet.tar.gz
$ tar xvfz ldgourmet.tar.gz
x areas.csv
x categories.csv
x prefs.csv
x ratings.csv
...

$ cd ~/work

$ wget https://github.com/livedoor/datasets/raw/master/ldgourmet.tar.gz

$ tar xvfz ldgourmet.tar.gz

x areas.csv

x categories.csv

x prefs.csv

x ratings.csv

...

CSV がわらわらとディレクトリ直下にできちゃったので、フォルダにまとめました。


$ mkdir ldgourmet
$ mv *.csv ldgourmet

$ mkdir ldgourmet

$ mv *.csv ldgourmet

restaurants.csv を使おうと思います。データ構造を確認する。


$ vi ldgourmet/restaurants.csv
id,name,property,alphabet,name_kana,pref_id,area_id,station_id1,station_time1,station_distance1,station_id2,station_time2,station_distance2,station_id3,station_time3,station_distance3,category_id1,category_id2,category_id3,category_id4,category_id5,zip,address,north_latitude,east_longitude,description,purpose,open_morning,open_lunch,open_late,photo_count,special_count,menu_count,fan_count,access_count,created_on,modified_on,closed
2,"ラ・マーレ・ド・茶屋","2F・3F","LA MAREE DE CHAYA","らまーれどちゃや",14,1013,2338,22,1789,2401,28,2240,2867,47,3755,201,0,0,0,0,240-0113,"三浦郡葉山町堀内24-3",35.16.53.566,139.34.20.129,"こちら2.3Ｆのレストランへのコメントになります。  『ラ・マーレ・ド・茶屋』1F(テラス&バー)へのコメントはそちらにお願いします。    駐車場15台(専用)    06/06/19　営業時間等更新（From東京グルメ）",,0,1,0,1,0,0,5,6535,"2000-09-10 11:22:02","2011-04-22 16:05:12",0
...

$ vi ldgourmet/restaurants.csv

id,name,property,alphabet,name_kana,pref_id,area_id,station_id1,station_time1,station_distance1,station_id2,station_time2,station_distance2,station_id3,station_time3,station_distance3,category_id1,category_id2,category_id3,category_id4,category_id5,zip,address,north_latitude,east_longitude,description,purpose,open_morning,open_lunch,open_late,photo_count,special_count,menu_count,fan_count,access_count,created_on,modified_on,closed

2,"ラ・マーレ・ド・茶屋","2F・3F","LA MAREE DE CHAYA","らまーれどちゃや",14,1013,2338,22,1789,2401,28,2240,2867,47,3755,201,0,0,0,0,240-0113,"三浦郡葉山町堀内24-3",35.16.53.566,139.34.20.129,"こちら2.3Ｆのレストランへのコメントになります。『ラ・マーレ・ド・茶屋』1F(テラス&バー)へのコメントはそちらにお願いします。駐車場15台(専用) 06/06/19　営業時間等更新（From東京グルメ）",,0,1,0,1,0,0,5,6535,"2000-09-10 11:22:02","2011-04-22 16:05:12",0

...

1行目がヘッダーになっている。

Elasticsearch でスキーマ定義

restaurants.csv のデータを Elasticsearch に実際に入れる前に、Elasticsearch でのデータの取扱を定義するスキーマを作成します。

ここで、MySQL では出てこない N-gram, アナライザーなどの知識が必要になるので、知らない場合は以下の記事に目を通しておくと先に進めやすいです。

ビッグデータ処理の常識をJavaで身につける（1）：検索エンジンの常識をApache Solrで身につける (1/4) – ＠IT

最初私は、N-gram、アナライザー、トークナイザー、転置インデックスって何それ？という状態だったのですが、この＠ITの記事のおかげですんなりと概念を掴めました。Apache Solr のトピックではありますが、それらの必要な基礎知識は、Elasticsearch でも一緒です。

では、Elasticsearch のスキーマ定義に戻ります。

Elasticsearchチュートリアル – 不可視点のページで Elasticsearch のデータ構造の図がとても分かりやすいです。MySQL でいうと、index が database に、type が table に相当します。

Elasticsearch のスキーマ定義では、mappings プロパティ以下で type で扱うプロパティ名（MySQLでいうカラム名）とそのデータ型、アナライザー（後述）を設定します。mappings は、MySQL でいう create table の際にカラム名とそのデータ型を指定するのに近い感じです。

もう一つ、Elasticsearch のスキーマ定義で大事なのが、analysis プロパティ以下で設定する filter（フィルター）、tokenizer（トークナイザー）、analyzer（アナライザー）の指定。filter はストップワードの指定、tokenizer は利用するトークナイザーを N-gram にするか形態素解析にするかの指定などを行い、analyzer は filter と tokenizer を組み合わせて独自のアナライザーを作成します。それぞれ、複数の filter, tokenizer, analyzer を定義することが可能です。

restaurants.csv を元に Elasticsearch のスキーマ定義

それでは restaurants.csv に戻って、それに対応する Elasticsearch のスキーマを定義します。analysis プロパティについては、実践！Elasticsearch – Wantedly Engineer Blog を参考にしつつ、mappings プロパティは restaurants.csv とにらめっこしながら作成。


$ vi schema.json
{
  "settings": {
    "analysis": {
      "filter": {
        "pos_filter": {
          "type": "kuromoji_part_of_speech",
          "stoptags": [
            "助詞-格助詞-一般",
            "助詞-終助詞"
          ]
        },
        "greek_lowercase_filter": {
          "type": "lowercase",
          "language": "greek"
        }
      },
      "tokenizer": {
        "kuromoji": {
          "type": "kuromoji_tokenizer"
        },
        "ngram_tokenizer": {
          "type": "nGram",
          "min_gram": "2",
          "max_gram": "3",
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      },
      "analyzer": {
        "kuromoji_analyzer": {
          "type": "custom",
          "tokenizer": "kuromoji",
          "filter": [
            "kuromoji_baseform",
            "pos_filter",
            "greek_lowercase_filter",
            "cjk_width"
          ]
        },
        "ngram_analyzer": {
          "tokenizer": "ngram_tokenizer"
        }
      }
    }
  },
  "mappings": {
    "restaurant": {
      "_source": {
        "enabled": true
      },
      "_all": {
        "enabled": true,
        "analyzer": "kuromoji_analyzer"
      },
      "properties": {
        "id": {
          "type": "integer",
          "index": "not_analyzed"
        },
        "name": {
          "type": "string",
          "index": "analyzed",
          "analyzer": "ngram_analyzer"
        },
        "property": {
          "type": "string",
          "index": "analyzed",
          "analyzer": "ngram_analyzer"
        },
        "alphabet": {
          "type": "string",
          "index": "analyzed",
          "analyzer": "ngram_analyzer"
        },
        "name_kana": {
          "type": "string",
          "index": "analyzed",
          "analyzer": "ngram_analyzer"
        },
        "pref_id": {
          "type": "integer",
          "index": "not_analyzed"
        },
        "area_id": {
          "type": "integer",
          "index": "not_analyzed"
        },
        "station_id1": {
          "type": "integer",
          "index": "not_analyzed"
        },
        "station_time1": {
          "type": "integer",
          "index": "not_analyzed"
        },
        "station_distance1": {
          "type": "integer",
          "index": "not_analyzed"
        },
        "station_id2": {
          "type": "integer",
          "index": "not_analyzed"
        },
        "station_time2": {
          "type": "integer",
          "index": "not_analyzed"
        },
        "station_distance2": {
          "type": "integer",
          "index": "not_analyzed"
        },
        "station_id3": {
          "type": "integer",
          "index": "not_analyzed"
        },
        "station_time3": {
          "type": "integer",
          "index": "not_analyzed"
        },
        "station_distance3": {
          "type": "integer",
          "index": "not_analyzed"
        },
        "category_id1": {
          "type": "integer",
          "index": "not_analyzed"
        },
        "category_id2": {
          "type": "integer",
          "index": "not_analyzed"
        },
        "category_id3": {
          "type": "integer",
          "index": "not_analyzed"
        },
        "category_id4": {
          "type": "integer",
          "index": "not_analyzed"
        },
        "category_id5": {
          "type": "integer",
          "index": "not_analyzed"
        },
        "zip": {
          "type": "string",
          "index": "analyzed",
          "analyzer": "ngram_analyzer"
        },
        "address": {
          "type": "string",
          "index": "analyzed",
          "analyzer": "kuromoji_analyzer"
        },
        "north_latitude": {
          "type": "string",
          "index": "analyzed",
          "analyzer": "ngram_analyzer"
        },
        "east_longitude": {
          "type": "string",
          "index": "analyzed",
          "analyzer": "ngram_analyzer"
        },
        "description": {
          "type": "string",
          "index": "analyzed",
          "analyzer": "kuromoji_analyzer"
        },
        "purpose": {
          "type": "string",
          "index": "analyzed",
          "analyzer": "ngram_analyzer"
        },
        "open_morning": {
          "type": "integer",
          "index": "not_analyzed"
        },
        "open_lunch": {
          "type": "integer",
          "index": "not_analyzed"
        },
        "open_late": {
          "type": "integer",
          "index": "not_analyzed"
        },
        "photo_count": {
          "type": "integer",
          "index": "not_analyzed"
        },
        "special_count": {
          "type": "integer",
          "index": "not_analyzed"
        },
        "menu_count": {
          "type": "integer",
          "index": "not_analyzed"
        },
        "fan_count": {
          "type": "integer",
          "index": "not_analyzed"
        },
        "access_count": {
          "type": "integer",
          "index": "not_analyzed"
        },
        "created_on": {
          "type": "string",
          "index": "analyzed",
          "analyzer": "ngram_analyzer"
        },
        "modified_on": {
          "type": "string",
          "index": "analyzed",
          "analyzer": "ngram_analyzer"
        },
        "closed": {
          "type": "integer",
          "index": "not_analyzed"
        }
      }
    }
  }
}

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

$ vi schema.json

{

"settings": {

"analysis": {

"filter": {

"pos_filter": {

"type": "kuromoji_part_of_speech",

"stoptags": [

"助詞-格助詞-一般",

"助詞-終助詞"

]

"greek_lowercase_filter": {

"type": "lowercase",

"language": "greek"

}

"tokenizer": {

"kuromoji": {

"type": "kuromoji_tokenizer"

"ngram_tokenizer": {

"type": "nGram",

"min_gram": "2",

"max_gram": "3",

"token_chars": [

"letter",

"digit"

]

}

"analyzer": {

"kuromoji_analyzer": {

"type": "custom",

"tokenizer": "kuromoji",

"filter": [

"kuromoji_baseform",

"pos_filter",

"greek_lowercase_filter",

"cjk_width"

]

"ngram_analyzer": {

"tokenizer": "ngram_tokenizer"

}

"mappings": {

"restaurant": {

"_source": {

"enabled": true

"_all": {

"enabled": true,

"analyzer": "kuromoji_analyzer"

"properties": {

"id": {

"type": "integer",

"index": "not_analyzed"

"name": {

"type": "string",

"index": "analyzed",

"analyzer": "ngram_analyzer"

"property": {

"type": "string",

"index": "analyzed",

"analyzer": "ngram_analyzer"

"alphabet": {

"type": "string",

"index": "analyzed",

"analyzer": "ngram_analyzer"

"name_kana": {

"type": "string",

"index": "analyzed",

"analyzer": "ngram_analyzer"

"pref_id": {

"type": "integer",

"index": "not_analyzed"

"area_id": {

"type": "integer",

"index": "not_analyzed"

"station_id1": {

"type": "integer",

"index": "not_analyzed"

"station_time1": {

"type": "integer",

"index": "not_analyzed"

"station_distance1": {

"type": "integer",

"index": "not_analyzed"

"station_id2": {

"type": "integer",

"index": "not_analyzed"

"station_time2": {

"type": "integer",

"index": "not_analyzed"

"station_distance2": {

"type": "integer",

"index": "not_analyzed"

"station_id3": {

"type": "integer",

"index": "not_analyzed"

"station_time3": {

"type": "integer",

"index": "not_analyzed"

"station_distance3": {

"type": "integer",

"index": "not_analyzed"

"category_id1": {

"type": "integer",

"index": "not_analyzed"

"category_id2": {

"type": "integer",

"index": "not_analyzed"

"category_id3": {

"type": "integer",

"index": "not_analyzed"

"category_id4": {

"type": "integer",

"index": "not_analyzed"

"category_id5": {

"type": "integer",

"index": "not_analyzed"

"zip": {

"type": "string",

"index": "analyzed",

"analyzer": "ngram_analyzer"

"address": {

"type": "string",

"index": "analyzed",

"analyzer": "kuromoji_analyzer"

"north_latitude": {

"type": "string",

"index": "analyzed",

"analyzer": "ngram_analyzer"

"east_longitude": {

"type": "string",

"index": "analyzed",

"analyzer": "ngram_analyzer"

"description": {

"type": "string",

"index": "analyzed",

"analyzer": "kuromoji_analyzer"

"purpose": {

"type": "string",

"index": "analyzed",

"analyzer": "ngram_analyzer"

"open_morning": {

"type": "integer",

"index": "not_analyzed"

"open_lunch": {

"type": "integer",

"index": "not_analyzed"

"open_late": {

"type": "integer",

"index": "not_analyzed"

"photo_count": {

"type": "integer",

"index": "not_analyzed"

"special_count": {

"type": "integer",

"index": "not_analyzed"

"menu_count": {

"type": "integer",

"index": "not_analyzed"

"fan_count": {

"type": "integer",

"index": "not_analyzed"

"access_count": {

"type": "integer",

"index": "not_analyzed"

"created_on": {

"type": "string",

"index": "analyzed",

"analyzer": "ngram_analyzer"

"modified_on": {

"type": "string",

"index": "analyzed",

"analyzer": "ngram_analyzer"

"closed": {

"type": "integer",

"index": "not_analyzed"

}

今回は入門目的なので、親子関係の type は作成せず、restaurants.csv 単一の type を作成します。カラム名に相当するプロパティの type（データ型）には、integer, string を使いました。当初 geo_point, date も使っていたのですが、後述する CSV からデータをインデックスする際に例外が発生してエラーになりましたので。Elasticsearch で利用できるデータ型は以下で確認できます。

Core Types

作成した schema.json から index を作成します。index 名を ldgourmet とします。curl コマンドと REST API で操作できます。


$ curl -XPOST localhost:9200/ldgourmet -d @schema.json
{"acknowledged":true}%

$ curl -XPOST localhost:9200/ldgourmet -d @schema.json

{"acknowledged":true}%

認識 true となっているので上手く行ったみたいです。http://127.0.0.1:9200/_plugin/head/ にアクセスしてみますと、ldgourmet という index を確認できました。

CSV River Plugin で restaurants.csv データを Elasticsearch に入れる

さてとスキーマ定義を元に ldgourmet という名前の Elasticsearch の index を作成できたので、次はいよいよ restaurants.csv から ldgourmet の index へとデータを流し込みます（インデックスさせる）。冒頭のプラグインのインストールで入れた CSV River Plugin を利用します。

https://github.com/AgileWorksOrg/elasticsearch-river-csv

データをインデックスさせるための json ファイルを作成します。


$ vi import_data.json
{
  "type" : "csv",
  "csv_file" : {
    "folder" : "/Users/username/work/ldgourmet",
    "filename_pattern" : "restaurants\\.csv$",
    "fields" : [
      "id",
      "name",
      "property",
      "alphabet",
      "name_kana",
      "pref_id",
      "area_id",
      "station_id1",
      "station_time1",
      "station_distance1",
      "station_id2",
      "station_time2",
      "station_distance2",
      "station_id3",
      "station_time3",
      "station_distance3",
      "category_id1",
      "category_id2",
      "category_id3",
      "category_id4",
      "category_id5",
      "zip",
      "address",
      "north_latitude",
      "east_longitude",
      "description",
      "purpose",
      "open_morning",
      "open_lunch",
      "open_late",
      "photo_count",
      "special_count",
      "menu_count",
      "fan_count",
      "access_count",
      "created_on",
      "modified_on",
      "closed"
    ],
    "first_line_is_header" : "true",
    "field_separator" : ",",
    "quote_character" : "\"",
    "field_id" : "id",
    "concurrent_requests" : "1"
  },
  "index" : {
    "index" : "ldgourmet",
    "type" : "restaurant",
    "bulk_size" : 100,
    "bulk_threshold" : 10
  }
}

$ vi import_data.json

{

"type" : "csv",

"csv_file" : {

"folder" : "/Users/username/work/ldgourmet",

"filename_pattern" : "restaurants\\.csv$",

"fields" : [

"id",

"name",

"property",

"alphabet",

"name_kana",

"pref_id",

"area_id",

"station_id1",

"station_time1",

"station_distance1",

"station_id2",

"station_time2",

"station_distance2",

"station_id3",

"station_time3",

"station_distance3",

"category_id1",

"category_id2",

"category_id3",

"category_id4",

"category_id5",

"zip",

"address",

"north_latitude",

"east_longitude",

"description",

"purpose",

"open_morning",

"open_lunch",

"open_late",

"photo_count",

"special_count",

"menu_count",

"fan_count",

"access_count",

"created_on",

"modified_on",

"closed"

"first_line_is_header" : "true",

"field_separator" : ",",

"quote_character" : "\"",

"field_id" : "id",

"concurrent_requests" : "1"

"index" : {

"index" : "ldgourmet",

"type" : "restaurant",

"bulk_size" : 100,

"bulk_threshold" : 10

}

concurrent_requests, bulk_size, bulk_threshold とかいまいち分からないのですけど、とりあえず Github の readme 通りの設定で。import_data.json を作成したら、ldgourmet の index にデータをインポートします。


$ curl -XPUT localhost:9200/_river/my_csv_river/_meta -d @import_data.json

$ curl -XPUT localhost:9200/_river/my_csv_river/_meta -d @import_data.json

ここで注意点がひとつあるのですが、この CSV River プラグインを使うコマンドを実行後は、利用した CSV ファイルが自動的に restaurants.csv → restaurants.csv.processing.imported とリネームされます。なので、やり直す場合などは、以下のように cp する。


$ cp restaurants.csv.processing.imported restaurants.csv

$ cp restaurants.csv.processing.imported restaurants.csv

それと、私の環境ではここで問題が発生しました。データは入ったのですが、Elasticsearch の head プラグイン（http://127.0.0.1:9200/_plugin/head/）で確認すると、データ中の日本語で文字化けが発生しました。文字化けなどが起こらなければ、以上の方法でOKです。

Mac の Java はデフォルト SJIS の文字コードらしいのですが、それが原因かなあ… Java のことよく分からないし原因特定も解決もできませんでしたので、私は結局以降に説明する代替手段でデータをインポートしました。

Elasticsearch の index api を使ってデータを入れる

Index API

上記のAPIを使って試しにデータを何件か入れてみたところ、日本語の文字化けが発生せず正常にデータが入ることを確認できたので、以下の ruby スクリプトを書きました。

insert_data.rb

#!/usr/bin/env ruby

require "csv"

CSV.open("ldgourmet/restaurants.csv", "r") do |f|
  f.each_with_index do |item, i|
    next if i == 0
    p item
    `curl -XPUT 'http://localhost:9200/ldgourmet/restaurant/#{item[0]}' -d '
      {
        "id": "#{item[0]}",
        "name": "#{item[1]}",
        "property": "#{item[2]}",
        "alphabet": "#{item[3]}",
        "name_kana": "#{item[4]}",
        "pref_id": "#{item[5]}",
        "area_id": "#{item[6]}",
        "station_id1": "#{item[7]}",
        "station_time1": "#{item[8]}",
        "station_distance1": "#{item[9]}",
        "station_id2": "#{item[10]}",
        "station_time2": "#{item[11]}",
        "station_distance2": "#{item[12]}",
        "station_id3": "#{item[13]}",
        "station_time3": "#{item[14]}",
        "station_distance3": "#{item[15]}",
        "category_id1": "#{item[16]}",
        "category_id2": "#{item[17]}",
        "category_id3": "#{item[18]}",
        "category_id4": "#{item[19]}",
        "category_id5": "#{item[20]}",
        "zip": "#{item[21]}",
        "address": "#{item[22]}",
        "north_latitude": "#{item[23]}",
        "east_longitude": "#{item[24]}",
        "description": "#{item[25]}",
        "purpose": "#{item[26]}",
        "open_morning": "#{item[27]}",
        "open_lunch": "#{item[28]}",
        "open_late": "#{item[29]}",
        "photo_count": "#{item[30]}",
        "special_count": "#{item[31]}",
        "menu_count": "#{item[32]}",
        "fan_count": "#{item[33]}",
        "access_count": "#{item[34]}",
        "created_on": "#{item[35]}",
        "modified_on": "#{item[36]}",
        "closed": "#{item[37]}"
      }
    '`
  end
end

#!/usr/bin/env ruby

require "csv"

CSV.open("ldgourmet/restaurants.csv", "r") do |f|

f.each_with_index do |item, i|

next if i == 0

p item

`curl -XPUT 'http://localhost:9200/ldgourmet/restaurant/#{item[0]}' -d '

{

"id": "#{item[0]}",

"name": "#{item[1]}",

"property": "#{item[2]}",

"alphabet": "#{item[3]}",

"name_kana": "#{item[4]}",

"pref_id": "#{item[5]}",

"area_id": "#{item[6]}",

"station_id1": "#{item[7]}",

"station_time1": "#{item[8]}",

"station_distance1": "#{item[9]}",

"station_id2": "#{item[10]}",

"station_time2": "#{item[11]}",

"station_distance2": "#{item[12]}",

"station_id3": "#{item[13]}",

"station_time3": "#{item[14]}",

"station_distance3": "#{item[15]}",

"category_id1": "#{item[16]}",

"category_id2": "#{item[17]}",

"category_id3": "#{item[18]}",

"category_id4": "#{item[19]}",

"category_id5": "#{item[20]}",

"zip": "#{item[21]}",

"address": "#{item[22]}",

"north_latitude": "#{item[23]}",

"east_longitude": "#{item[24]}",

"description": "#{item[25]}",

"purpose": "#{item[26]}",

"open_morning": "#{item[27]}",

"open_lunch": "#{item[28]}",

"open_late": "#{item[29]}",

"photo_count": "#{item[30]}",

"special_count": "#{item[31]}",

"menu_count": "#{item[32]}",

"fan_count": "#{item[33]}",

"access_count": "#{item[34]}",

"created_on": "#{item[35]}",

"modified_on": "#{item[36]}",

"closed": "#{item[37]}"

}

end

パーミッション変更。


$ chmod 755 insert_data.rb

$ chmod 755 insert_data.rb

データのインポートを実行。


$ ruby insert_data.rb

$ ruby insert_data.rb

restaurants.csv は20万件以上のレコードなので、データを全部入れ終わるのにしばらく時間がかかります。

以上で無事にデータが Elasticsearch に入りました。とりあえず今日はここまで、次回に Elasticsearch のクエリを使って検索を試した記事を書きます。

全文検索システムを実装するには、ElasticSearch がおすすめです。

高速スケーラブル検索エンジン ElasticSearch Server

>> 次の記事 : Elasticsearchのクエリとフィルターで簡単な検索を試す例

ElasticsearchのインストールとCSVからのデータ挿入

Elasticsearch とプラグインのインストール、動作確認

Elasticsearch に入れるデータをダウンロード

Elasticsearch でスキーマ定義

restaurants.csv を元に Elasticsearch のスキーマ定義

CSV River Plugin で restaurants.csv データを Elasticsearch に入れる

Elasticsearch の index api を使ってデータを入れる

Leave Your Message! コメントをキャンセル