Data Lake
/publish
Load data from files in S3 or GCS into the data lake.
- rights: publish
- verbs: POST
| Parameter | Type | Required |
|---|---|---|
| file | String | Yes. Fully qualified file path, e.g. s3://yourbucket/path/to/file.json or gs://yourgcsbucket/path/to/file.json. |
| rules | TransformRuleSet | No |
| format | String | No. One of: txt, csv, json, jsonl. If not present, filename will be inspected for format. |
| compression | String | No. One of: gz, zip. If not present, filename will be inspected. |
| encoding | String | No. Any Java encoding format. Default UTF-8. |
| delimiter | char | No. Character for delimited file formats. Tabs and commas assumed for txt/csv unless specified. |
| headers | boolean | No. Whether headers are present for delimited formats. Default true. If false, a TransformRuleSet is required. |
| rows | Integer | No. Max rows to load. Negative or omitted loads all rows. |
curl https://test-m1.minusonedb.com/publish \
-d "file=s3://m1-public/reddit/us.jsonl.gz" \
-H "m1-auth-token: $myToken"m1 test-m1 publish -file "file=s3://m1-public/reddit/us.jsonl.gz"To load data via /publish you must configure permissions so that the EC2 instance profile role associated with your environment can read the file(s) you are attempting to /publish.
// Example ReadAccess policy for S3 Configuration
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ReadAccess",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::248899197673:role/$instanceProfileRole"
},
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::$yourBucket",
"arn:aws:s3:::$yourBucket/*"
]
}
]
}Configuration Steps When Publishing from S3
1. Run /env/bucket/register to register the bucket with your environment
2. In your AWS account, configure a bucket policy so that the instance profile role of your environment has read access to your data. You can obtain the instanceProfileRole from /env/get
3. Validate your bucket access by calling /publish with one of your files and rows=0. If you get an AccessDeniedException, recheck your IAM permissions above.
4. You can now call /publish on all the files you wish to load into your environment.
# create a new service account
gcloud --configuration dev iam service-accounts create whateveryouwant --display-name="Whatever you want"
# serviceAccountEmail is the email address of the service account you created above
# yourProjectId is the project in which you created your service account
gcloud --configuration dev projects add-iam-policy-binding $yourProjectId \
--member="serviceAccount:$serviceAccountEmail" \
--role="roles/storage.objectViewer"Configuration Steps When Publishing from GCS
1. Create a service account in your google cloud account and allow it to read/download files in your bucket(s)
(You can skip the step if you already have a service account that has sufficient access to the data you wish to load in your m1db environment)
2. Create a trust relationship between the EC2 instance profile role associated with your environment and your service account
# create a trust relationship between your service account and your environment instance role
# m1dbEnvironmentInstanceProfileRole is the "instanceProfileRole" attribute in the response returned by "m1 ops env/get -env your-env"
gcloud --configuration dev iam service-accounts add-iam-policy-binding $serviceAccountEmail \
--role=roles/iam.workloadIdentityUser \
--member="principalSet://iam.googleapis.com/projects/980494489932/locations/global/workloadIdentityPools/aws-pool/attribute.aws_role/arn:aws:sts::248899197673:assumed-role/$m1dbEnvironmentInstanceProfileRole"3. Set the gcs-service-account system property to the email address of your service account.
# Set the gcs-service-account system property to point to your service account
m1 your-env system -gcs-service-account "$yourAccountEmail"4. Enable outbound connectivity for your environment
# Enable outbound connectivity for your environment
m1 ops env/outbound -env your-env -enable true5. Validate your GCS bucket access by calling /publish with one of your files and rows=0. If you get an access error, recheck your configuration steps. Note that it may take a few minutes for your configuration to take effect.
6. You can now call /publish on all the files you wish to load into your environment.
/modify
Modify datalake with inserts, updates or deletions. This method is a generalization of /insert, /update, /delete. Beware of bulk updates across many different files, updating many documents at once will take much longer.
- rights: publish
- verbs: POST
| Parameter | Type | Required |
|---|---|---|
| e | String[] | No. Raw parameter entities |
| delete | String[] | No. Raw ids to delete |
- returns: JSON [{},...]
curl https://test-m1.minusonedb.com/modify \
-d 'e=[{"score":"199","downs":"10","author":"Alice"}]&delete=["ids=82270000"]' \
-H "m1-auth-token: $myToken"m1 test-m1 modify -e '[{"score" : "199", "downs" : "10", "author" : "Alice"}]' -delete "["ids=82270000"]"/insert
Insert raw entities into the lake.
- rights: publish
- verbs: POST
| Parameter | Type | Required |
|---|---|---|
| e | String[] | No. Raw parameter entities |
- returns: JSON [{},...]
curl https://test-m1.minusonedb.com/insert \
-d 'e=[{"score" : "199", "downs" : "10", "author" : "Alice"}]' -H "m1-auth-token: $myToken"m1 test-m1 insert -e "[{"score" : "199", "downs" : "10", "author" : "Alice"}]"/update
Update documents in lake by passing in entities associated with their _m1key. Beware of bulk updates across many different files, updating many documents at once will take much longer.
- rights: delete
- verbs: POST
| Parameter | Type | Required |
|---|---|---|
| e | String[] | No. Raw parameter entities |
- returns: JSON [{},...]
curl https://test-m1.minusonedb.com/update \
-d 'e=[{"_m1key" : "82270000", "score" : "199", "downs" : "10", "author" : "Alice"}]' -H "m1-auth-token: $myToken"m1 test-m1 update -e "[{"_m1key" : "82270000", "score" : "199", "downs" : "10", "author" : "Alice"}]"/delete
Delete _m1key records from the datalake for the specified list of m1keys.
- rights: delete
- verbs: POST
| Parameter | Type | Required |
|---|---|---|
| ids | String[] | No. List of m1key ids to be deleted |
- returns: JSON [{},...]
curl https://test-m1.minusonedb.com/delete \
-d "ids=["82270000"]" \
-H "m1-auth-token: $myToken"m1 test-m1 delete -ids '["82270000"]'/next
Retrieve the next _m1key that will be assigned to a document added to the lake (via /publish, for example).
- rights: admin, publish
- verbs: GET
- parameters: none
curl https://test-m1.minusonedb.com/next \
-H "m1-auth-token: $myToken"m1 test-m1 next/get
Retrieve any number of rows from the data lake via _m1key property.
- rights: get
- verbs: GET, POST
| Parameter | Type | Required |
|---|---|---|
| ids | long[] | Yes. IDs of records to be retrieved. |
| properties | Array | No. List of properties from schema to include in records. If null, all columns are returned. |
- returns: JSON [{},...]
curl https://test-m1.minusonedb.com/get \
-d 'ids=[10000,20000,30000]&properties=["_m1key","session.id"]' \
-H "m1-auth-token: $myToken"m1 test-m1 get -ids "[10000,20000,30000]" -properties '["_m1key","session.id"]'/range
Retrieve all rows from datalake with _m1key values between start (inclusive) and end (exclusive).
- rights: get
- verbs: GET, POST
| Parameter | Type | Required |
|---|---|---|
| start | long | Yes. Inclusive. |
| end | long | Yes. Exclusive. |
| properties | Array | No. List of properties from schema to include in records. If null, all columns are returned. |
- returns: JSON [{},...]
curl https://test-m1.minusonedb.com/range \
-d 'start=10000&end=30000&properties=["_m1key","session.id"]' \
-H "m1-auth-token: $myToken"m1 test-m1 range -start 10000 -end 30000

