Skip to content

Data Lake

/publish

Load data from files in S3 or GCS into the data lake.

  • rights: publish
  • verbs: POST
ParameterTypeRequired
fileStringYes. Fully qualified file path, e.g. s3://yourbucket/path/to/file.json or gs://yourgcsbucket/path/to/file.json.
rulesTransformRuleSetNo
formatStringNo. One of: txt, csv, json, jsonl. If not present, filename will be inspected for format.
compressionStringNo. One of: gz, zip. If not present, filename will be inspected.
encodingStringNo. Any Java encoding format. Default UTF-8.
delimitercharNo. Character for delimited file formats. Tabs and commas assumed for txt/csv unless specified.
headersbooleanNo. Whether headers are present for delimited formats. Default true. If false, a TransformRuleSet is required.
rowsIntegerNo. Max rows to load. Negative or omitted loads all rows.
bash
curl https://test-m1.minusonedb.com/publish \
-d "file=s3://m1-public/reddit/us.jsonl.gz" \
-H "m1-auth-token: $myToken"
bash
m1 test-m1 publish -file "file=s3://m1-public/reddit/us.jsonl.gz"
200 OKPublish progress

To load data via /publish you must configure permissions so that the EC2 instance profile role associated with your environment can read the file(s) you are attempting to /publish.

json
// Example ReadAccess policy for S3 Configuration
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ReadAccess",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::248899197673:role/$instanceProfileRole"
      },
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::$yourBucket",
        "arn:aws:s3:::$yourBucket/*"
      ]
    }
  ]
}

Configuration Steps When Publishing from S3

1. Run /env/bucket/register to register the bucket with your environment

2. In your AWS account, configure a bucket policy so that the instance profile role of your environment has read access to your data. You can obtain the instanceProfileRole from /env/get

3. Validate your bucket access by calling /publish with one of your files and rows=0. If you get an AccessDeniedException, recheck your IAM permissions above.

4. You can now call /publish on all the files you wish to load into your environment.

bash
# create a new service account
gcloud --configuration dev iam service-accounts create whateveryouwant --display-name="Whatever you want"
# serviceAccountEmail is the email address of the service account you created above
# yourProjectId is the project in which you created your service account
gcloud --configuration dev projects add-iam-policy-binding $yourProjectId  \
          --member="serviceAccount:$serviceAccountEmail" \
          --role="roles/storage.objectViewer"

Configuration Steps When Publishing from GCS

1. Create a service account in your google cloud account and allow it to read/download files in your bucket(s)

(You can skip the step if you already have a service account that has sufficient access to the data you wish to load in your m1db environment)

2. Create a trust relationship between the EC2 instance profile role associated with your environment and your service account

bash
# create a trust relationship between your service account and your environment instance role
# m1dbEnvironmentInstanceProfileRole is the "instanceProfileRole" attribute in the response returned by "m1 ops env/get -env your-env"
gcloud --configuration dev iam service-accounts add-iam-policy-binding $serviceAccountEmail \
--role=roles/iam.workloadIdentityUser \
--member="principalSet://iam.googleapis.com/projects/980494489932/locations/global/workloadIdentityPools/aws-pool/attribute.aws_role/arn:aws:sts::248899197673:assumed-role/$m1dbEnvironmentInstanceProfileRole"

3. Set the gcs-service-account system property to the email address of your service account.

bash
# Set the gcs-service-account system property to point to your service account
m1 your-env system -gcs-service-account "$yourAccountEmail"

4. Enable outbound connectivity for your environment

bash
# Enable outbound connectivity for your environment
m1 ops env/outbound -env your-env -enable true

5. Validate your GCS bucket access by calling /publish with one of your files and rows=0. If you get an access error, recheck your configuration steps. Note that it may take a few minutes for your configuration to take effect.

6. You can now call /publish on all the files you wish to load into your environment.

/modify

Modify datalake with inserts, updates or deletions. This method is a generalization of /insert, /update, /delete. Beware of bulk updates across many different files, updating many documents at once will take much longer.

  • rights: publish
  • verbs: POST
ParameterTypeRequired
eString[]No. Raw parameter entities
deleteString[]No. Raw ids to delete
  • returns: JSON [{},...]
bash
curl https://test-m1.minusonedb.com/modify \
-d 'e=[{"score":"199","downs":"10","author":"Alice"}]&delete=["ids=82270000"]' \
-H "m1-auth-token: $myToken"
bash
m1 test-m1 modify -e '[{"score" : "199", "downs" : "10", "author" : "Alice"}]' -delete "["ids=82270000"]"
200 OKRetrieved records

/insert

Insert raw entities into the lake.

  • rights: publish
  • verbs: POST
ParameterTypeRequired
eString[]No. Raw parameter entities
  • returns: JSON [{},...]
bash
curl https://test-m1.minusonedb.com/insert \
-d 'e=[{"score" : "199", "downs" : "10", "author" : "Alice"}]' -H "m1-auth-token: $myToken"
bash
m1 test-m1 insert -e "[{"score" : "199", "downs" : "10", "author" : "Alice"}]"
200 OKRetrieved records

/update

Update documents in lake by passing in entities associated with their _m1key. Beware of bulk updates across many different files, updating many documents at once will take much longer.

  • rights: delete
  • verbs: POST
ParameterTypeRequired
eString[]No. Raw parameter entities
  • returns: JSON [{},...]
bash
curl https://test-m1.minusonedb.com/update \
-d 'e=[{"_m1key" : "82270000", "score" : "199", "downs" : "10", "author" : "Alice"}]' -H "m1-auth-token: $myToken"
bash
m1 test-m1 update -e "[{"_m1key" : "82270000", "score" : "199", "downs" : "10", "author" : "Alice"}]"
200 OKRetrieved records

/delete

Delete _m1key records from the datalake for the specified list of m1keys.

  • rights: delete
  • verbs: POST
ParameterTypeRequired
idsString[]No. List of m1key ids to be deleted
  • returns: JSON [{},...]
bash
curl https://test-m1.minusonedb.com/delete \
-d "ids=["82270000"]" \
-H "m1-auth-token: $myToken"
bash
m1 test-m1 delete -ids '["82270000"]'
200 OKRetrieved records

/next

Retrieve the next _m1key that will be assigned to a document added to the lake (via /publish, for example).

  • rights: admin, publish
  • verbs: GET
  • parameters: none
bash
curl https://test-m1.minusonedb.com/next \
-H "m1-auth-token: $myToken"
bash
m1 test-m1 next
200 OKNext available key

/get

Retrieve any number of rows from the data lake via _m1key property.

  • rights: get
  • verbs: GET, POST
ParameterTypeRequired
idslong[]Yes. IDs of records to be retrieved.
propertiesArrayNo. List of properties from schema to include in records. If null, all columns are returned.
  • returns: JSON [{},...]
bash
curl https://test-m1.minusonedb.com/get \
-d 'ids=[10000,20000,30000]&properties=["_m1key","session.id"]' \
-H "m1-auth-token: $myToken"
bash
m1 test-m1 get -ids "[10000,20000,30000]" -properties '["_m1key","session.id"]'
200 OKRetrieved records

/range

Retrieve all rows from datalake with _m1key values between start (inclusive) and end (exclusive).

  • rights: get
  • verbs: GET, POST
ParameterTypeRequired
startlongYes. Inclusive.
endlongYes. Exclusive.
propertiesArrayNo. List of properties from schema to include in records. If null, all columns are returned.
  • returns: JSON [{},...]
bash
curl https://test-m1.minusonedb.com/range \
-d 'start=10000&end=30000&properties=["_m1key","session.id"]' \
-H "m1-auth-token: $myToken"
bash
m1 test-m1 range -start 10000 -end 30000
200 OKRecords in range

© 2021-2026 MinusOne, Inc.